Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
نویسندگان
چکیده
Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for codeswitched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as regionspecific, with 58M tweets.
منابع مشابه
Analysis and Prediction of Dutch-English Code-switching in Dutch Social Media Messages
Multi-lingual phenomena as code-switching disturb widely used language interpretation tools, while the demand for such tools is rising due to the expanding worldwide popularity of online applications. This study explores code-switching between the lexically strong related languages Dutch and English in Twitter messages. Contrary to similar studies on code-switching, the focus is centred on the ...
متن کاملThe Tel Aviv University System for the Code-Switching Workshop Shared Task
We describe our entry in the EMNLP 2014 code-switching shared task. Our system is based on a sequential classifier, trained on the shared training set using various characterand word-level features, some calculated using a large monolingual corpora. We participated in the Twitter-genre Spanish-English track, obtaining an accuracy of 0.868 when measured on the tweet level and 0.858 on the word l...
متن کاملAutomatic Detection of Intra-Word Code-Switching
Many people are multilingual and they may draw from multiple language varieties when writing their messages. This paper is a first step towards analyzing and detecting code-switching within words. We first segment words into smaller units. Then, words are identified that are composed of sequences of subunits associated with different languages. We demonstrate our method on Twitter data in which...
متن کاملRecurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus
Mixed language data is one of the difficult yet less explored domains of natural language processing. Most research in fields like machine translation or sentiment analysis assume monolingual input. However, people who are capable of using more than one language often communicate using multiple languages at the same time. Sociolinguists believe this ”code-switching” phenomenon to be socially mo...
متن کاملA Novel Generalized Topology for Multi-level Inverter with Switched Series-parallel DC Sources (RESEARCH NOTE)
This paper presents a novel topology of single-phase multilevel inverter for low and high power applications. It consists of polarity (Level) generation circuit and H Bridge. The proposed topology can produce higher output voltage levels by connecting dc voltage sources in series and parallel. The proposed topology utilizes minimum number of power electronic devices which helps in reduction o...
متن کامل